TQ: Add support for alarms in the protocol #8753

andrewjstone · 2025-08-01T23:01:18Z

This builds on #8741

An alarm represents a protocol invariant violation. It's unclear exactly what should be done about these other than recording them and allowing them to be reported upstack, which is what is done in this PR. An argument could be made for "freezing" the state machine such that trust quorum nodes stop working and the only thing they can do is report alarm status. However, that would block the trust quorum from operating at all, and it's unclear if this should cause an outage on that node.

I'm also somewhat hesitant to put the alarms into the persistent state as that would prevent unlock in the case of a sled/rack reboot.

On the flip side of just recording is the possible danger resulting from operating with an invariant violation. This could potentially be risky, and since we shouldn't ever see these maybe pausing for a support call is the right thing. TBD, once more work is done on the protocol.

An alarm represents a protocol invariant violation. It's unclear exactly what should be done about these other than recording them and allowing them to be reported upstack, which is what is done in this PR. An argument could be made for "freezing" the state machine such that trust quorum nodes stop working and the only thing they can do is report alarm status. However, that would block the trust quorum from operating at all, and it's unclear if this should cause an outage on that node. I'm also somewhat hesitant to put the alarms into the persistent state as that would prevent unlock in the case of a sled/rack reboot. On the flip side of just recording is the possible danger resulting from operating with an invariant violation. This could potentially be risky, and since we shouldn't ever see these maybe pausing for a support call is the right thing. TBD, once more work is done on the protocol.

It's not actually an error to receive a `CommitAdvance` while coordinating for the same epoch. The `GetShare` from the coordinator could have been delayed in the network` and the node that received it already committed before the coordinator knew it was done preparing. In essence, the following would happen: 1. The coordinator would send GetShare requests for the prior epoch 2. Enough nodes would reply so that the coordinator would start sending prepares. 3. Enough nodes would ack prepares to commit 4. Nexus would poll and send commits. Other nodes would get those commits, but not the coordinator 5. A node that hadn't yet received the `GetShare` would get a `CommitAdvance` or see the `Commit` from nexus and get it's configuration and recompute it's own share and commit. It may have been a prior coordinator with delayed deliveries to other nodes of `GetShare` messages. 6. The node that just committed finally receives the `GetShare` and sends back a `CommitAdvance` to the coordinator This is all valid, and was similar to a proptest counterexample

This PR builds on #8753 This is a hefty PR, but it's not as bad as it looks. Over 4k lines of it is in the example log file in the second commit. There's also some moved and unmodified code that I'll point out. This PR introduces a new test tool for the trust-quorum protocol: tqdb. tqdb is a repl that takes event traces produced by the `cluster` proptest and uses them for deterministic replay of actions against test state. The test state includes a "universe" of real protocol nodes, a fake nexus, and fake networks. The proptest and debugging state is shared and contained in the `trust-quorum-test-utils`. The debugger allows a variety of functionality including stepping through individual events, setting breakpoints, snapshotting and diffing states and viewing the event log itself. The purpose of tqdb is twofold: 1. Allow for debugging of failed proptests. This is non-trivial in some cases, even with shrunken tests, because the generated actions are high-level and are all generated up front. The actual operations such as reconfigurations are derived from these high level random generations in conjunction with the current state of the system. Therefore the set of failing generated actions doesn't really tell you much. You have to look at the logs, and the assertion that fired and reason about it with incomplete information. Now, for each concrete action taken, we record the event in a log. In the case of a failure an event log can be loaded into tqdb, with a breakpoint set right before the failure. A snapshot of the state can be taken, and then the failing event can be applied. The diff will tell you what changed and allow you to inspect the actual state of the system. Full visibility into your failure is now possible. 2. The trust quorum protocol is non-trivial. Tqdb allows developers to see in detail how the protocol behaves and understand what is happening in certain situations. Event logs can be created by hand (or script) for particularly interesting scenarios and then run through tqdb. In order to get the diff functionality to work as I wanted, I had to implement `Eq` for types that implemented `subtle::ConstantTimeEq` in both `gfss` (our secret sharing library), and `trust-quorum` crates. However the safety in terms of the compiler breaking the constant time guarantees is unknown. Therefore, a feature flag was added such that only `test-utils` and `tqdb` crates are able to use these implementations. They are not used in the production codebase. Feature unification is not at play here because neither `test-utils` or `tqdb` are part of the product.

andrewjstone requested review from sunshowers and plotnick August 1, 2025 23:01

andrewjstone force-pushed the tq-alarms branch 3 times, most recently from 8c5b6bd to ad388eb Compare August 2, 2025 23:39

andrewjstone mentioned this pull request Aug 8, 2025

TQ: Introduce tqdb #8801

Merged

andrewjstone force-pushed the tq-reconfigure branch from cf0b76f to 6ef670f Compare August 27, 2025 15:05

Base automatically changed from tq-reconfigure to main August 27, 2025 19:35

andrewjstone added 3 commits August 27, 2025 21:25

Fix CommitAdvance behavior for expunged nodes

ec4baa6

andrewjstone force-pushed the tq-alarms branch from 9465f5a to ec4baa6 Compare August 27, 2025 21:26

andrewjstone enabled auto-merge (squash) August 27, 2025 22:25

andrewjstone merged commit d4df3f7 into main Aug 27, 2025
16 checks passed

andrewjstone deleted the tq-alarms branch August 27, 2025 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TQ: Add support for alarms in the protocol #8753

TQ: Add support for alarms in the protocol #8753

andrewjstone commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

TQ: Add support for alarms in the protocol #8753

TQ: Add support for alarms in the protocol #8753

Conversation

andrewjstone commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!